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Abstract 


We consider a class of sparse learning problems in high dimensional feature space regularized by 
a structured sparsity-inducing norm that incorporates prior knowledge of the group structure of 
the features. Such problems often pose a considerable challenge to optimization algorithms due 
to the non-smoothness and non-separability of the regularization term. In this paper, we focus 
on two commonly adopted sparsity-inducing regularization terms, the overlapping Group Lasso 
penalty J; //2-norm and the lı //..-norm. We propose a unified framework based on the augmented 
Lagrangian method, under which problems with both types of regularization and their variants 
can be efficiently solved. As one of the core building-blocks of this framework, we develop new 
algorithms using a partial-linearization/splitting technique and prove that the accelerated versions 
of these algorithms require O(g) iterations to obtain an €-optimal solution. We compare the 
performance of these algorithms against that of the alternating direction augmented Lagrangian 
and FISTA methods on a collection of data sets and apply them to two real-world problems to 
compare the relative merits of the two norms. 

Keywords: structured sparsity, overlapping Group Lasso, alternating direction methods, variable 
splitting, augmented Lagrangian 


1. Introduction 


For feature learning problems in a high-dimensional space, sparsity in the feature vector is usually a 
desirable property. Many statistical models have been proposed in the literature to enforce sparsity, 
dating back to the classical Lasso model (/)-regularization) (Tibshirani, 1996; Chen et al., 1999). 
The Lasso model is particularly appealing because it can be solved by very efficient proximal gradi- 
ent methods; for example, see Combettes and Pesquet (2011). However, the Lasso does not take into 
account the structure of the features (Zou and Hastie, 2005). In many real applications, the features 
in a learning problem are often highly correlated, exhibiting a group structure. Structured sparsity 
has been shown to be effective in those cases. The Group Lasso model (Yuan and Lin, 2006; Bach, 
2008; Roth and Fischer, 2008) assumes disjoint groups and enforces sparsity on the pre-defined 
groups of features. This model has been extended to allow for groups that are hierarchical as well 
as overlapping (Jenatton et al., 2011; Kim and Xing, 2010; Bach, 2010) with a wide array of appli- 
cations from gene selection (Kim and Xing, 2010) to computer vision (Huang et al., 2009; Jenatton 
et al., 2010). For image denoising problems, extensions with non-integer block sizes and adaptive 
partitions have been proposed by Peyre and Fadili (2011) and Peyre et al. (2011). In this paper, we 
consider the following basic model of minimizing the squared-error loss with a regularization term 
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to induce group sparsity: 


min L(x) + Q(x), (1) 
xeR” 
where 
1 
L(x) = 5l4x— all, AeR™", 
Qrp (x) = ALses wsllxsll, or 
Q(x) = e i (2) 
œ) Toe , 


S = {s1, +: ,5)s|} is the set of group indices with |S| = J, and the elements (features) in the groups 
possibly overlap (Chen et al., 2010; Mairal et al., 2010; Jenatton et al., 2011; Bach, 2010). In this 
model, À, ws, S are all pre-defined. || - || without a subscript denotes the l2-norm. We note that the 
penalty term Q; /;,(x) in (2) is different from the one proposed by Jacob et al. (2009), although 
both are called overlapping Group Lasso penalties. In particular, (1)-(2) cannot be cast into a non- 
overlapping group lasso problem as done by Jacob et al. (2009). 


1.1 Related Work 


Two proximal gradient methods have been proposed to solve a close variant of (1) with an l; /l 
penalty, 
min L(x) + Q; (x) +Allal|, (3) 


xER” 


which has an additional /,-regularization term on x. Chen et al. (2010) replace Q;,/;,(x) with a 
smooth approximation Qy (x) by using Nesterov’s smoothing technique (Nesterov, 2005) and solve 
the resulting problem by the Fast Iterative Shrinkage Thresholding algorithm (FISTA) (Beck and 
Teboulle, 2009). The parameter 1 is a smoothing parameter, upon which the practical and theoretical 
convergence speed of the algorithm critically depends. Liu and Ye (2010) also apply FISTA to solve 
(3), but in each iteration, they transform the computation of the proximal operator associated with 
the combined penalty term into an equivalent constrained smooth problem and solve it by Nesterov’s 
accelerated gradient descent method (Nesterov, 2005). Mairal et al. (2010) apply the accelerated 
proximal gradient method to (1) with lı /l- penalty and propose a network flow algorithm to solve 
the proximal problem associated with Q; /;, (x). The method proposed by Mosci et al. (2010) for 
solving the Group Lasso problem in Jacob et al. (2009) is in the same spirit as the method of Liu 
and Ye (2010), but their approach uses a projected Newton method. 


1.2 Our Contributions 


We take a unified approach to tackle problem (1) with both lı //)- and lı /..-regularizations. Our 
strategy is to develop efficient algorithms based on the Alternating Linearization Method with Skip- 
ping (ALM-S) (Goldfarb et al., 2011) and FISTA for solving an equivalent constrained version 
of problem (1) (to be introduced in Section 2) in an augmented Lagrangian method framework. 
Specifically, we make the following contributions in this paper: 


e We build a general framework based on the augmented Lagrangian method, under which 
learning problems with both /; //2 and lı /1.. regularizations (and their variants) can be solved. 
This framework allows for experimentation with its key building blocks. 





1. This norm has been further investigated and renamed as latent Group Lasso (Obozinski et al., 2011). 
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e We propose new algorithms: ALM-S with partial splitting (APLM-S) and FISTA with partial 
linearization (FISTA-p), to serve as the key building block for this framework. We prove that 
APLM-S and FISTA-p have convergence rates of O(;) and O(h) respectively, where k is the 
number of iterations. Our algorithms are easy to implement and tune, and they do not require 
line-search, eliminating the need to evaluate the objective function at every iteration. 


e We evaluate the quality and speed of the proposed algorithms and framework against state-of- 
the-art approaches on a rich set of synthetic test data and compare the /; /l2 and lı /lo models 
on breast cancer gene expression data (Van De Vijver et al., 2002) and a video sequence 
background subtraction task (Mairal et al., 2010). 


2. A Variable-Splitting Augmented Lagrangian Framework 


In this section, we present a unified framework, based on variable splitting and the augmented La- 
grangian method for solving (1) with both J; //)- and lı /l..-regularizations. This framework refor- 
mulates problem (1) as an equivalent linearly-constrained problem, by using the following variable- 
splitting procedure. 

Let y € R»<s!! be the vector obtained from the vector x € R” by repeating components of x 
so that no component of y belongs to more than one group. Let M = ),<5|s|. The relationship 
between x and y is specified by the linear constraint Cx = y, where the (i, j)-th element of the matrix 
C e RMX is 

C= { 1, ify; is areplicate of x;, 
"J" \ 0, otherwise. 


For examples of C, refer to Chen et al. (2010). Consequently, (1) is equivalent to 


1 E 
min Fopj(x,y) = 5llAx— I)? + Q(y) (4) 
S.t. Cx =y, 


where Õ(y) is the non-overlapping group-structured penalty term corresponding to Q(y) defined in 
(2). 

Note that C is a highly sparse matrix, and D = C’C is a diagonal matrix with the diagonal 
entries equal to the number of times that each entry of x is included in some group. Problem (4) 
now includes two sets of variables x and y, where x appears only in the loss term L(x) and y appears 
only in the penalty term Q(y). 

All the non-overlapping versions of Q/(-), including the Lasso and Group Lasso, are special 
cases of Q(-), with C = /, that is, x = y. Hence, (4) in this case is equivalent to applying variable- 
splitting on x. Problems with a composite penalty term, such as the Elastic Net, A4||x||; + A2]|x||", 
can also be reformulated in a similar way by merging the smooth part of the penalty term (A2||x||7 
in the case of the Elastic Net) with the loss function L(x). 

To solve (4), we apply the augmented Lagrangian method (Hestenes, 1969; Powell, 1972; No- 
cedal and Wright, 1999; Bertsekas, 1999) to it. This method, Algorithm 1, minimizes the augmented 
Lagrangian 





1 1 z 
L(x,» v) = z lAr =B? —v" (Cx—y) + z lC- +20) (5) 
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exactly for a given Lagrange multiplier v in every iteration followed by an update to v. The parame- 
ter uin (5) controls the amount of weight that is placed on violations of the constraint Cx = y. Algo- 
rithm 1 can also be viewed as a dual ascent algorithm applied to P(v) = min, L(x, y, v) (Bertsekas, 
1976), where v is the dual variable, 7 is the step-length, and Cx — y is the gradient V,P(v). This 


Algorithm 1 AugLag 





1: Choose x°, y°, v®. 

2: for / =0,1,--- do 

3: (x! tl y+!) © argminyy L(x, y, v!) 
4: pt] e y! ORN yit!) 
5 

6 





Update u according to the chosen updating scheme. 
: end for 





algorithm does not require u to be very small to guarantee convergence to the solution of problem 
(4) (Nocedal and Wright, 1999). However, solving the problem in Line 3 of Algorithm 1 exactly can 
be very challenging in the case of structured sparsity. We instead seek an approximate minimizer 
of the augmented Lagrangian via the abstract subroutine ApproxAugLagMin(x, y,v). The following 
theorem (Rockafellar, 1973) guarantees the convergence of this inexact version of Algorithm 1. 


Theorem 1 Let a! := L£(x!,y',v!) — inf crm yeru L(x,y, v’) and F* be the optimal value of problem 
(4). Suppose problem (4) satisfies the modified Slater’s condition, and 


` Val < +o. (6) 
l=1 


Then, the sequence {v'} converges to v*, which satisfies 


; ; AAV A = — F* 
eer (Fob; (x,y) (v')' (Cx y)) =k, 


while the sequence {x!,y'} satisfies lim _,..Cx! — y! = 0 and lim-4.. Fop i ay par, 
The condition (6) requires the augmented Lagrangian subproblem be solved with increasing ac- 
curacy. We formally state this framework in Algorithm 2. We index the iterations of Algorithm 


Algorithm 2 OGLasso-AugLag 
1: Choose x°, y°, v®. 
2: for / =0,1,--- do 
3: (xt! y+!) 4 ApproxAugLagMin(x!,y’,v’), to compute an approximate minimizer of 
L(x,y,v') 
4: pit] 4 y! won y!) 
5: Update u according to the chosen updating scheme. 
6: end for 











2 by l and call them ‘outer iterations’. In Sections 3, we develop algorithms that implement 
ApproxAugLagMin(x, y,v). The iterations of these subroutine are indexed by k and are called ‘inner 
iterations’. 
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3. Methods for Approximately Minimizing the Augmented Lagrangian 


In this section, we use the overlapping Group Lasso penalty Q(x) = AY ses Ws||xs|| to illustrate the 
optimization algorithms under discussion. The case of lı /J..-regularization will be discussed in 
Section 4. From now on, we assume without loss of generality that ws = 1 for every group s. 


3.1 Alternating Direction Augmented Lagrangian (ADAL) Method 


The well-known Alternating Direction Augmented Lagrangian (ADAL) method (Eckstein and Bert- 
sekas, 1992; Gabay and Mercier, 1976; Glowinski and Marroco, 1975; Boyd et al., 2010)? approx- 
imately minimizes the augmented Lagrangian by minimizing (5) with respect to x and y alternat- 
ingly and then updates the Lagrange multiplier v on each iteration (e.g., see Bertsekas and Tsit- 
siklis, 1989, Section 3.4). Specifically, the single-iteration procedure that serves as the procedure 
ApproxAugLagMin(x, y,v) is given below as Algorithm 3. 


Algorithm 3 ADAL 





1: Given x, y!, and v’. 
2: x/+1 4 argmin, L(x, y!, v!) 
3: yt! & argminy L(x!+! y, v") 
4 


: return xt! ytl, 








The ADAL method, also known as the alternating direction method of multipliers (ADMM) 
and the split Bregman method, has recently been applied to problems in signal and image process- 
ing (Combettes and Pesquet, 2011; Afonso et al., 2009; Goldstein and Osher, 2009) and low-rank 
matrix recovery (Lin et al., 2010). Its convergence has been established by Eckstein and Bertsekas 
(1992). This method can accommodate a sum of more than two functions. For example, by ap- 
plying variable-splitting (e.g., see Bertsekas and Tsitsiklis, 1989; Boyd et al., 2010) to the problem 
min, f(x) + Z£; g;(C;x), it can be transformed into 


K 
min f(x) + D silo) 


XSY1y YK 


S.t. y=Cx, i=1,--:,K. 


The subproblems corresponding to y,’s can thus be solved simultaneously by the ADAL method. 
This so-called simultaneous direction method of multipliers (SDMM) (Setzer et al., 2010) is related 
to Spingarn’s method of partial inverses (Spingarn, 1983) and has been shown to be a special in- 
stance of a more general parallel proximal algorithm with inertia parameters (Pesquet and Pustelnik, 
2010). 

Note that the problem solved in Line 3 of Algorithm 3, 


. i 1 6) 
yt! = argmin L(x +! y, v’) = argmin { ail =al +20} (9 


where d! = Cx'+! — uv', is group-separable and hence can be solved in parallel. As in Qin et al. 
(2010), each subproblem can be solved by applying the block soft-thresholding operator, T (d!, uA) = 





2. Recently, Mairal et al. (2011) also applied ADAL with two variants based on variable-splitting to the overlapping 
Group Lasso problem. 
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fe max(0, ||d!|| — Aw),s = 1,--- ,J. Solving for x/+! in Line 2 of Algorithm 3, that is, 


1 1 
xt! — arg min £(x,y',v!) = argmin 4 —||Ax— b||? — (')?Cx + —||Cx—y' ||? >, (8) 
x x 2 2u 
involves solving the linear system 
1 1 
(ATA + —D)x = ATb+C7v! + —CTy’, (9) 
u u 


where the matrix on the left hand side of (9) has dimension m x m. Many real-world data sets, such 
as gene expression data, are highly under-determined. Hence, the number of features (m) is much 
larger than the number of samples (n). In such cases, one can use the Sherman-Morrison-Woodbury 
formula, 


1 
(ATA +-D)`} =D"! — u? DHAT (I + uAD HAT) IAD}, 
u 


and solve instead an n x n linear system involving the matrix J+ uAD~'A’. In addition, as long as 
u stays the same, one has to factorize ATA + iD or 1+uAD~'!A’ only once and store their factors 
for subsequent iterations. 

When both n and m are very large, it might be infeasible to compute or store ATA, not to 
mention its eigen-decomposition, or the Cholesky decomposition of ATA + 1D. In this case, one 
can solve the linear systems using the preconditioned Conjugate Gradient (PCG) method (Golub 
and Van Loan, 1996). Similar comments apply to the other algorithms proposed in Sections 3.2 - 
3.4 below.Alternatively, we can apply FISTA to Line 3 in Algorithm 2 (see Section 3.5). 


3.2 ALM-S: partial split (APLM-S) 


We now consider applying the Alternating Linearization Method with Skipping (ALM-S) from 
Goldfarb et al. (2011) to approximately minimize (5). In particular, we apply variable splitting 
(Section 2) to the variable y, to which the group-sparse regularizer Q is applied, (the original ALM- 
S splits both variables x and y,) and re-formulate (5) as follows. 





: 1 1 Loe 
min ||Ax — b||? — v" (Cx—y) + = [Cx -yl +8) (10) 
xy, 2 2u 

s.t. yoy. 


Note that the Lagrange multiplier v is fixed here. Defining 





— 1l SE 1 2 
fey) = glara- (Cx) +z, aD 
gv) = Q0) =A) Isl, (12) 
problem (10) is of the form 
min f(x,y) +8(9) (13) 
S:t. y=y, 


to which we now apply partial-linearization. 
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3.2.1 PARTIAL LINEARIZATION AND CONVERGENCE RATE ANALYSIS 


Let us define 


F(x,y) f(x,y) +8(y) = £(x,y5), 


1 
Ly PA = SEDA (¥-y)4 a9 9 yl’, (14) 








where yis the Lagrange multiplier in the augmented Lagrangian (14) corresponding to problem (13). 
We now present our partial-split alternating linearization algorithm to implement 
ApproxAugLagMin(x, y, v) in Algorithm 2. 


Algorithm 4 APLM-S 


1: Given x°, 7°, v. Choose p,y’, such that —ẹ € dg(¥°). Define f(x,y) as in (11). 
2: for k =0,1,--- until stopping criterion is satisfied do 














3 (x oA wr) A. arg min, y Lp ( x YI Y). 

a if F(x,y!) > Lp (xt, ytt, Sk Y) then 

5: yt! oy 

6: xttl 4 argmin, f (x,y!) = arg miny Ly (x; y"*! 74, %) 

7 end if 

8: yet 1 + pr(x k+1 yt!) = arg mins Lor yy Vif ea) 
9: YT e Vf yet) — a 

10: end for 

11: return (xK+! 54+1) 





We note that in Line 6 in Algorithm 4, 


yerl = arg min Lp (x Hy k $) = = arg min f (x 


sy") = arg min f(x"). (15) 
Now, we have a variant of Lemma 2.2 in Goldfarb et al. (2011). 
Lemma 1 For any (x,y), if q := argmins Lp (x,y, Y, Vy f(x,y), and 

F(x,9) < Lp(x,y,9, Vyf (x,y)), (16) 


then for any (X,¥), 


2p(F (4,9) —F(x,q)) 2 |l@—5I? — lly JI? +29 ((@ —) Vif (x,y). (17) 





Similarly, for any 9, if (p,q) := arg miny y Ly (x,y, Y, —Ye(¥)), Ye) is a sub-gradient of g at ¥, and 


F (p,q) < £p((P,9),3,—Ye(¥))s (18) 


then for any (x,y), 
2p(F (x,y) —F(p,4)) 2 lla—yll? — l5 — yll”. (19) 
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Proof See Appendix A. E 


Algorithm 4 checks condition (18) at Line 4 because the function g is non-smooth and condition 
(18) may not hold no matter what the value of p is. When this condition is violated, a skipping step 
occurs in which the value of y is set to the value of y in the previous iteration (Line 5) and £p 
re-minimized with respect to x (Line 6) to ensure convergence. Let us define a regular iteration of 
Algorithm 4 to be an iteration where no skipping step occurs, that is, Lines 5 and 6 are not executed. 
Likewise, we define a skipping iteration to be an iteration where a skipping step occurs. Now, we 
are ready to state the iteration complexity result for APLM-S. 


Theorem 2 Assume that Vy f(x,y) is Lipschitz continuous in y with Lipschitz constant Ly(f), that 
is, for any x, ||Vy f(x,y) — Vyf(x,z)|| < Ly(f)|ly —z||, for all y and z. For p < GOP the iterates 
F 


(x*, ¥*) in Algorithm 4 satisfy 


[eels 


FE, y) — F(x" wy petra 2p(k+kn J 


Yk, (20) 


where (x*,y*) is an optimal solution to (10), and ky is the number of regular iterations among the 
first k iterations. 


Proof See Appendix B. | 


Remark 1 For Theorem 2 to hold, we need P < 7 Bw 7: From the definition of f (x,y) in (11), it is easy 


to see that Ly(f) = i regardless of the loss function L(x). Hence, we set p = u, so that condition 
(16) in Lemma 1 is satisfied. 


In Section 3.3, we will discuss the case where the iterations entirely consist of skipping steps. 
We will show that this is equivalent to ISTA (Beck and Teboulle, 2009) with partial linearization as 


well as a variant of ADAL. In this case, the inner Lagrange multiplier y is redundant. 


3.2.2 SOLVING THE SUBPROBLEMS 


We now show how to solve the subproblems in Algorithm 4. First, observe that since p = u, 
z ; z ES Š 
aremin Goley 3V) = argmin{ VS) llO 


1 
ind sla sil? + AL ISl}, 
argmin{ 5-ld—sI?-+AZ bul} 


IIl 





where d = Cx — uv. Hence, y can be obtained by applying the block soft-thresholding operator 
T (ds, uà) as in Section 3.1. Next consider the subproblem 





. a : A Lave. 
min £p(x..Fe1) = min d fly) LF —y) + Elly we}. Q1) 
(x.y) (x,y) Qu 
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It is easy to verify that solving the linear system given by the optimality conditions for (21) by block 
Gaussian elimination yields the system 


Ire py em 
Qu Dat or 


for computing x, where ry = ATb + C7v and ry = —v+y+ x Then y can be computed as (5)(ry + 
Jex): 

As in Section 3.1, only one Cholesky factorization of ATA + zP is required for each invocation 
of Algorithm 4. Hence, the amount of work involved in each iteration of Algorithm 4 is comparable 
to that of an ADAL iteration. 

It is straightforward to derive an accelerated version of Algorithm 4, which we shall refer to as 
FAPLM-S, that corresponds to a partial-split version of the FALM algorithm proposed by Goldfarb 


et al. (2011) and also requires O(4/ HA) iterations to obtain an €-optimal solution. In Section 3.4, 
we present an algorithm FISTA-p, which is a special version of FAPLM-S in which every iteration 
is a skipping iteration and which has a much simpler form than FAPLM-S, while having essentially 
the same iteration complexity. 

It is also possible to apply ALM-S directly, which splits both x and y, to solve the augmented 


Lagrangian subproblem. Similar to (10), we reformulate (5) as 





1 1 
E Ax—b|? -v" (Cx—y) + = llCx -yl +A È 5: (22) 
(x,y), (5) zll l ( ) zul | LI s|| 
s.t. KEK 


y=). 
The functions f and g are defined as in (11) and (12), except that now we write g as g(x,y) even 
though the variable + does not appear in the expression for g. It can be shown that y admits exactly 


the same expression as in APLM-S, whereas x is obtained by a gradient step, x — pVxf (x,y). To 
obtain x, we solve the linear system 





(a7a+ > pa LI) x= r+ BSC iy. (23) 
utp P utp 


after which y is computed by y = (4) (r + Lex) A 


Remark 2 For ALM-S, the Lipschitz constant for V f (x,y) Lf = Amaxl ATA) + 7 amax where dmax = 
max; Dj; > 1. For the complexity results in Goldfarb et al. (2011) to hold, we need p < m Since 


Amax(ATA) is usually not known, it is necessary to perform a backtracking line-search on p to ensure 
that F (x+! y+!) < Lo (kth yt! sk Sk oy). In practice, we adopted the following continuation 
scheme instead. We initially set p = po = La and decreased p by a factor of B after a given number 
of iterations until p reached a user-supplied minimum value Pmin. This scheme prevents p from being 
too small, and hence negatively impacting computational performance. However, in both cases the 


left-hand-side of the system (23) has to be re-factorized every time p is updated. 


As we have seen above, the Lipschitz constant resulting from splitting both x and y is potentially 
much larger than T Hence, partial-linearization reduces the Lipschitz constant and hence improves 
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the bound on the right-hand-side of (20) and allows Algorithm 4 to take larger step sizes (equal to 
u). Compared to ALM-S, solving for x in the skipping step (Line 6) becomes harder. Intuitively, 
APLM-S does a better job of ‘load-balancing’ by managing a better trade-off between the hardness 
of the subproblems and the practical convergence rate. 


3.3 ISTA: Partial Linearization (ISTA-p) 


We can also minimize the augmented Lagrangian (5), which we write as L(x,y,v) = f(x,y) + g8(y) 
with f(x,y) and g(y) defined as in (11) and (12), using a variant of ISTA that only linearizes f(x,y) 
with respect to the y variables. As in Section 3.2, we can set p = u and guarantee the convergence 
properties of ISTA-p (see Corollary 1 below). Formally, let (x,y) be the current iterate and (x*,y~) 
be the next iterate. We compute yt by 


yt = argmin £o(x,y,y's Vy f(x,y) 


re ie 
= samo} EEN dsl Y, 24) 


J 
where dy = Cx — uv. Hence the solution y* to problem (24) is given blockwise by T ([dy];,uA), j = 
Laney: 
Now given y7, we solve for x* by 


xt = argmin f(v,y*) 
x! 





1 1 
arg min ¢ —||Ax’ — b||? — v” (Cx’ — yt) + —||Cx’ — y+ ||? (25) 
x! 2 2u 


The algorithm that implements subroutine ApproxAugLagMin(x, y, v) in Algorithm 2 by ISTA with 
partial linearization is stated below as Algorithm 5. 


Algorithm 5 ISTA-p (partial linearization) 





: Given x°, 7°, v. Choose p. Define f(x,y) as in (11). 


: for k =0,1,--- until stopping criterion is satisfied do 


xt e arg min, f (x;*) 


1 
2 
3: 
4: yh 
5 
6 





y j s arg min, Lp (x**! J“, y, Vyf (+1, ¥*)) 
: end for 


: return (x“+! pXt!) 


y 





As we remarked in Section 3.2, Algorithm 5 is equivalent to Algorithm 4 (APLM-S) where 
every iteration is a skipping iteration. Hence, we have from Theorem 2. 


1 
LAY 





Corollary 1 Assume V,f(-,-) is Lipschitz continuous with Lipschitz constant Ly( f). For p < 
the iterates (x*,¥*) in Algorithm 5 satisfy 


pave 


~ Wk ? ue 


where (x*,y*) is an optimal solution to (10). 
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It is easy to see that (24) is equivalent to (7), and that (25) is the same as (8) in ADAL. 


Remark 3 We have shown that with a fixed v, the ISTA-p iterations are exactly the same as the 
ADAL iterations. The difference between the two algorithms is that ADAL updates the (outer) 
Lagrange multiplier v in each iteration, while in ISTA-p, v stays the same throughout the inner 
iterations. We can thus view ISTA-p as a variant of ADAL with delayed updating of the Lagrange 
multiplier. 


The ‘load-balancing’ behavior discussed in Section 3.2 is more obvious for ISTA-p. As we will 
see in Section 3.5, if we apply ISTA (with full linearization) to minimize (5), solving for x is simply 
a gradient step. Here, we need to minimize f(x,y) with respect to x exactly, while being able to take 
larger step sizes in the other subproblem, due to the smaller associated Lipschitz constant. 


3.4 FISTA-p 


We now present an accelerated version FISTA-p of ISTA-p. FISTA-p is a special case of FAPLM-S 
with a skipping step occurring in every iteration.We state the algorithm formally as Algorithm 6. 
The iteration complexity of FISTA-p (and FAPLM-S) is given by the following theorem. 


Algorithm 6 FISTA-p (partial linearization) 





1: Given x°, y°, v. Choose p, and z? = p°. Define f(x,y) as in (11). 
2: fork =0,1,---,K do 
k+1 LER 
3: xt & argmin, f (x;z“) 
4 ye S arg min, Lazy, Vyf ee) 
2 
ee nN eae 


etl ght (=) (F+ — 5h) 


fk+1 





: end for 


5 
6: 
7 
8: return ( 


pres vas 


y 





Theorem 3 Assuming that V,f(-) is Lipschitz continuous with Lipschitz constant Ly(f) and p < 


EG the sequence fae ; y} generated by Algorithm 6 satisfies 


E 2l —y*l? 
F x, k —F Ea * x j 
(ey) =F y) S CaF 


Although we need to solve a linear system in every iteration of Algorithms 4, 5, and 6, the 
left-hand-side of the system stays constant throughout the invocation of the algorithms because, 
following Remark 1, we can always set p = u. Hence, no line-search is necessary, and this step 
essentially requires only one backward- and one forward-substitution, the complexity of which is 
the same as a gradient step. 
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3.5 ISTA/FISTA: Full Linearization 





ISTA solves the following problem in each iteration to produce the next iterate ( yt ) ; 

















1 x! s 
min (g) -A E 
1 4 2 1 / 2 U 
= d,||* 4 .— dy. Ally |l) , 26 
spl! a? + E zp (APM) 26) 
where d = ( i ) = ( : ) —pV f(x,y), and f(x,y) is defined in (11). It is easy to see that we can 
y 
solve for x* and y* separately in (26). Specifically, 
x" = dy, (27) 
+ D :— 
Yj = max(0, ||d),|| — Ap), J=Hl,...,J7. 
Ila 


Using ISTA to solve the outer augmented Lagrangian (5) subproblem is equivalent to taking only 
skipping steps in ALM-S. In our experiments, we used the accelerated version of ISTA, that is, 
FISTA (Algorithm 7) to solve (5). 


Algorithm 7 FISTA 
1: Given £°, 7°, v. Choose p°. Set to = 0,2” ae jy’. Define f(x,y) as in (11). 
2: for k =0,1,--- until stopping criterion is satisfied do 
3: Perform a backtracking line-search on p, starting from po. 


. dy \ _ a k „k 
e (E-G) wres 


5: tleed, 


k4 dy. 
6 HAT ply max(0,lldy,||—Ap), J 1... od. 


























144/1+412 
Be Sree oe eae) 
k+l. ck, &—lcktl _ sk 
oS. Sey tee ae 
10: end for 
11: return (x<+! y4+1) 





FISTA (resp. ISTA) is, in fact, an inexact version of FISTA-p (resp. ISTA-p), where we mini- 
mize with respect to x a linearized approximation 


z 1 

F(x) = fz) T Vif (2°) (x —x*) F 2p |x —x*|I? 
of the quadratic objective function f(x,z*) in (25). The update to x in Line 3 of Algorithm 6 is 
replaced by (27) as a result. Similar to FISTA-p, FISTA is also a special skipping version of the 


full-split FALM-S. Considering that FISTA has an iteration complexity of O(z); it is not surprising 
that FISTA-p has the same iteration complexity. 
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Remark 4 Since FISTA requires only the gradient of f (x,y), it can easily handle any smooth convex 
loss function, such as the logistic loss for binary classification, L(x) = Y™_,log(1 +exp(—b;a? x)), 
where at is the i-th row of A, and b is the vector of labels. Moreover, when the scale of the data 
(min{n,m}) is so large that it is impractical to compute the Cholesky factorization of ATA, FISTA 
is a good choice to serve as the subroutine ApproxAugLagMin(x,y,v) in OGLasso-AugLag. 


4. Overlapping Group /; //..-Regularization 


The subproblems with respect to y (or y) involved in all the algorithms presented in the previous 
sections take the following form 


aol stad 
atl a 2 
mine yl +Q(y), (28) 


where Q(y) = AY, Ws||ys|leo in the case of lı /1..-regularization. In (7), for example, c = Cx — uv. 
The solution to (28) is the proximal operator of Q (Combettes and Wajs, 2006; Combettes and 
Pesquet, 2011). Similar to the classical Group Lasso, this problem is block-separable and hence all 
blocks can be solved simultaneously. 

Again, for notational simplicity, we assume ws = 1 Ws € Š and omit it from now on. For each 
s € S, the subproblem in (28) is of the form 


; 1 
min 5 [les = ysl? + PAllyslle- (29) 


As shown by Wright et al. (2009), the optimal solution to the above problem is cs — P(c;), where 
P denotes the orthogonal projector onto the ball of radius pA in the dual norm of the /..-norm, that 
is, the /;-norm. The Euclidean projection onto the simplex can be computed in (expected) linear 
time (Duchi et al., 2008; Brucker, 1984). Duchi et al. (2008) show that the problem of computing 
the Euclidean projection onto the /)-ball can be reduced to that of finding the Euclidean projection 
onto the simplex in the following way. First, we replace c, in problem (29) by |c;|, where the 
absolute value is taken component-wise. After we obtain the projection zs onto the simplex, we 
can construct the projection onto the /,-ball by setting y% = sign(cs)zs, where sign(-) is also taken 
component-wise. 


5. Experiments 


We tested the OGLasso-AugLag framework (Algorithm 2) with four subroutines: ADAL, FISTA, 
FISTA-p, and APLM-S. We implemented the framework with the first three subroutines in C++ to 
compare them with the ProxFlow algorithm proposed by Mairal et al. (2010). We used the C inter- 
face and BLAS and LAPACK subroutines provided by the AMD Core Math Library (ACML).? To 
compare with ProxGrad (Chen et al., 2010), we implemented the framework and all four algorithms 
in Matlab. We did not include ALM-S in our experiments because it is time-consuming to find the 
right p for the inner loops as discussed in Remark 2, and our preliminary computational experience 
showed that ALM-S was slower than the other algorithms, even when the heuristic p-setting scheme 
discussed in Remark 2 was used, because a large number of steps were skipping steps, which meant 





3. ACML can be found at http: //developer.amd.com/libraries/acml/pages/default.aspx. Ideally, we should 
have used the Intel Math Kernel Library (Intel MKL), which is optimized for Intel processors, but Intel MKL is not 
freely available. 
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Table 1: Specification of the quantities used in the outer and inner stopping criteria. 


that the computation involved in solving the linear systems in those steps was wasted. All of our 
experiments were performed on a laptop PC with an Intel Core 2 Duo 2.0 GHz processor and 4 Gb 
of memory. 


5.1 Algorithm Parameters and Termination Criteria 


Each algorithm (framework + subroutine)* required several parameters to be set and termination 
criteria to be specified. We used stopping criteria based on the primal and dual residuals suggested 
by Boyd et al. (2010). We specify the criteria for each of the algorithms below, but defer their 
derivation to Appendix C. The maximum number of outer iterations was set to 500, and the tolerance 
for the outer loop was set at Eou = 1074. The number of inner-iterations was capped at 2000, and 
the tolerance at the /-th outer iteration for the inner loops was el Our termination criterion for the 
outer iterations was 


max{r!,s!} < Eour, 


I|Cx'-y'I| 
So max{]|Cx'I II S, 
which is given for each algorithm in Table 1. Recall that K + 1 is the index of the last inner iteration 


of the /-th outer iteration; for example, for APLM-S, Gat y+) takes the value of the last inner 
iterate (x**!,5**!), We stopped the inner iterations when the maximum of the relative primal 
residual and the relative objective gradient for the inner problem was less than el. (See Table 1 for 
the expressions of these two quantities.) We see there that s’+! can be obtained directly from the 
relative gradient residual computed in the last inner iteration of the /-th outer iteration. 

We set uo = 0.01 in all algorithms except that we set uo = 0.1 in ADAL for the data sets other 
than the first synthetic set and the breast cancer data set. We set p = u in FISTA-p and APLM-S and 
Po = u in FISTA. 

For Theorem 1 to hold, the solution returned by the function ApproxAugLagMin(x, y, v) has to 
become increasingly more accurate over the outer iterations. However, it is not possible to evaluate 
the sub-optimality quantity o! in (6) exactly because the optimal value of the augmented Lagrangian 
L(x,y, v!) is not known in advance. In our experiments, we used the maximum of the relative primal 


where r! = is the outer relative primal residual and s’ is the relative dual residual, 





4. For conciseness, we use the subroutine names (e.g., FISTA-p) to represent the full algorithms that consist of the 
OGLasso-AugLag framework and the subroutines. 
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and dual residuals (max{r’,s'}) as a surrogate to a! for two reasons: First, it has been shown (Boyd 
et al., 2010) that r’ and s’ are closely related to œ’. Second, the quantities r’ and s’ are readily 
available as bi-products of the inner and outer iterations. To ensure that the sequence {e,,} satisfies 
(6), we basically set: 

in = Bin€ns (30) 


with e, = 0.01 and Bj, = 0.5. However, since we terminate the outer iterations at €,; > 0, it is 
not necessary to solve the subproblems to an accuracy much higher than the one for the outer loop. 
On the other hand, it is also important for el, to decrease to below £our, since s! is closely related 
to the quantities involved in the inner stopping criteria. Hence, we slightly modified (30) and used 
et = max{Bint!,, ,0.2€ out}. 

Recently, we became aware of an alternative ‘relative error’ stopping criterion (Eckstein and 
Silva, 2012) for the inner loops, which guarantees convergence of Algorithm 2. In our context, this 
criterion essentially requires that the absolute dual residual is less than a fraction of the absolute 


primal residual. For FISTA-p, for instance, this condition requires that the (/ + 1)-th iterate satisfies 


1 yi+1 

yy 
where 7 and § are the numerators in the expressions for r and s respectively, © = 0.99, w? is a 
constant, and wy is an auxiliary variable updated in each outer iteration by wit 1 wi — zC T (5K+1_ 
z). We experimented with this criterion but did not find any computational advantage over the 
heuristic based on the relative primal and dual residuals. 





Ernie Onlin 141)2 
SH + soy, 
u 











5.2 Strategies for Updating u 


The penalty parameter u in the outer augmented Lagrangian (5) not only controls the infeasibility in 
the constraint Cx = y, but also serves as the step-length in the y-subproblem (and the x-subproblem 
in the case of FISTA). We adopted two kinds of strategies for updating u. The first one simply kept 
u fixed. In this case, choosing an appropriate uo was important for good performance. This was 
especially true for ADAL in our computational experiments. Usually, a uo in the range of 107! to 
10-3 worked well. 

The second strategy is a dynamic scheme based on the values r’ and s’ (Boyd et al., 2010). Since 
i penalizes the primal infeasibility, a small u tends to result in a small primal residual. On the other 
hand, a large u tends to yield a small dual residual. Hence, to keep r’ and s! approximately balanced 
in each outer iteration, our scheme updated u as follows: 


max {Bu tmin}, if r! > Ts! 
es min{u! /B, Umax} if s! > tr! 
u, otherwise, 


where we set Umax = 10, Umin = 1076, t= 10 and B = 0.5, except for the first synthetic data set, 
where we set B = 0.1 for ADAL, FISTA-p, and APLM-S. 


5.3 Synthetic Examples 


To compare our algorithms with the ProxGrad algorithm of Chen et al. (2010), we first tested a 
synthetic data set (ogl) using the procedure reported by Chen et al. (2010) and Jacob et al. (2009). 
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The sequence of decision variables x were arranged in groups of ten, with adjacent groups having 
an overlap of three variables. The support of x was set to the first half of the variables. Each entry 
in the design matrix A and the non-zero entries of x were sampled from i.i.d. standard Gaussian 
distributions, and the output b was set to b = Ax +£, where the noise € ~ A((0,/). Two sets of data 
were generated as follows: (a) Fix n = 5000 and vary the number of groups J from 100 to 1000 with 
increments of 100. (b) Fix J = 200 and vary n from 1000 to 10000 with increments of 1000. The 
stopping criterion for ProxGrad was the same as the one used for FISTA, and we set its smoothing 
parameter to 107°. Figure 1 plots the CPU times taken by the Matlab version of our algorithms 
and ProxGrad (also in Matlab) on theses scalability tests on /; //2-regularization. A subset of the 
numerical results on which these plots are based is presented in Tables 4 and 5. 

The plots clearly show that the alternating direction methods were much faster than ProxGrad 
on these two data sets. Compared to ADAL, FISTA-p performed slightly better, while it showed 
obvious computational advantage over its general version APLM-S. In the plot on the left of Figure 
1, FISTA exhibited the advantage of a gradient-based algorithm when both n and m are large. In that 
case (towards the right end of the plot), the Cholesky factorizations required by ADAL, APLM-S, 
and FISTA-p became relatively expensive. When min{n,m} is small or the linear systems can be 
solved cheaply, as the plot on the right shows, FISTA-p and ADAL have an edge over FISTA due to 
the smaller numbers of inner iterations required. 

We generated a second data set (dct) using the approach of Mairal et al. (2010) for scalability 
tests on both the /;//2 and lı //.. group penalties. The design matrix A was formed from over- 
complete dictionaries of discrete cosine transforms (DCT). The set of groups were all the contiguous 
sequences of length five in one-dimensional space. x had about 10% non-zero entries, selected 
randomly. We generated the output as b = Ax + €, where € ~ AN((0,0.01||Ax||”). We fixed n = 1000 
and varied the number of features m from 5000 to 30000 with increments of 5000. This set of 
data leads to considerably harder problems than the previous set because the groups are heavily 
overlapping, and the DCT dictionary-based design matrix exhibits local correlations. Due to the 
excessive running time required on Matlab, we ran the C++ version of our algorithms for this data 
set, leaving out APLM-S and ProxGrad, whose performance compared to the other algorithms is 
already fairly clear from Figure 1. For ProxFlow, we set the tolerance on the relative duality gap to 
1074, the same as €,,;, and kept all the other parameters at their default values. 

Figure 2 presents the CPU times required by the algorithms versus the number of features. In 
the case of lı //,-regularization, it is clear that FISTA-p outperformed the other two algorithms. 
For lı /l..-regularization, ADAL and FISTA-p performed equally well and compared favorably to 
ProxFlow. In both cases, the growth of the CPU times for FISTA follows the same trend as that 
for FISTA-p, and they required a similar number of outer iterations, as shown in Tables 6 and 7. 
However, FISTA lagged behind in speed due to larger numbers of inner iterations. Unlike in the 
case of the ogl data set, Cholesky factorization was not a bottleneck for FISTA-p and ADAL here 
because we needed to compute it only once. 

To simulate the situation where computing or caching ATA and its Cholesky factorization is not 
feasible, we switched ADAL and FISTA-p to PCG mode by always using PCG to solve the linear 
systems in the subproblems. We compared the performance of ADAL, FISTA-p, and FISTA on 
the previous data set for both /; /J2 and lı /l.. models. The results for ProxFlow are copied from 
from Figure 2 and Table 9 to serve as a reference. We experimented with the fixed-value and 
the dynamic updating schemes for u on all three algorithms. From Figure 3, it is clear that the 
performance of FISTA-p was significantly improved by using the dynamic scheme. For ADAL, 
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Figure 1: Scalability test results of the algorithms on the synthetic overlapping Group Lasso data 
sets from Chen et al. (2010). The scale of the y-axis is logarithmic. The dynamic scheme 
for u was used for all algorithms except ProxGrad. 


however, the dynamic scheme worked well only in the l/l case, whereas the performance turned 
worse in general in the /,//., case. We did not include the results for FISTA with the dynamic 
scheme because the solutions obtained were considerably more suboptimal than the ones obtained 
with the fixed-u scheme. Tables 8 and 9 report the best results of the algorithms in each case. The 
plots and numerical results show that FISTA-p compares favorably to ADAL and stays competitive 
to ProxFlow. In terms of the quality of the solutions, FISTA-p and ADAL also did a better job than 
FISTA, as evidenced in Table 9. On the other hand, the gap in CPU time between FISTA and the 
other three algorithms is less obvious. 


5.4 Real-world Examples 


To demonstrate the practical usefulness of our algorithms, we tested our algorithms on two real- 
world applications. 


5.4.1 BREAST CANCER GENE EXPRESSIONS 


We used the breast cancer data set (Van De Vijver et al., 2002) with canonical pathways from 
MSigDB (Subramanian et al., 2005). The data was collected from 295 breast cancer tumor samples 
and contains gene expression measurements for 8,141 genes. The goal was to select a small set 
of the most relevant genes that yield the best prediction performance. A detailed description of 
the data set can be found in Chen et al. (2010) and Jacob et al. (2009). In our experiment, we 
performed a regression task to predict the length of survival of the patients. The canonical pathways 
naturally provide grouping information of the genes. Hence, we used them as the groups for the 
group-structured regularization term Q(-). 
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Figure 2: Scalability test results on the DCT set with lı /l2-regularization (left column) and l4 /l- 
regularization (right column). The scale of the y-axis is logarithmic. All of FISTA-p, 


FITSA, and ADAL were run with a fixed u = uo. 
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Figure 3: Scalability test results on the DCT set with lı //,-regularization (left column) and /; /I..- 
regularization (right column). The scale of the y-axis is logarithmic. FISTA-p and ADAL 
are in PCG mode. The dotted lines denote the results obtained with the dynamic updating 


scheme for u. 
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Data sets 


N (no. samples) 


J (no. groups) 


group size 


average frequency 





BreastCancerData 


295 


637 


23.7 (avg) 


4 























Table 2: The Breast Cancer Data Set 
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Figure 4: On the left: Plot of root-mean-squared-error against the number of active genes for the 
Breast Cancer data. The plot is based on the regularization path for ten different values for 
à. The total CPU time (in Matlab) using FISTA-p was 51 seconds for /; //2-regularization 
and 115 seconds for lı //..-regularization. On the right: The recovered sparse gene coef- 
ficients for predicting the length of the survival period. The value of À used here was the 
one minimizing the RMSE in the plot on the left. 


Table 2 summarizes the data attributes. The numerical results for the /; //2-norm are collected 
in Table 10, which show that FISTA-p and ADAL were the fastest on this data set. Again, we had 
to tune ADAL with different initial values (uo) and updating schemes of u for speed and quality of 
the solution, and we eventually kept u constant at 0.01. The dynamic updating scheme for u also 
did not work for FISTA, which returned a very suboptimal solution in this case. We instead adopted 
a simple scheme of decreasing u by half every 10 outer iterations. Figure 6 graphically depicts 
the performance of the different algorithms. In terms of the outer iterations, APLM-S behaved 
identically to FISTA-p, and FISTA also behaved similarly to ADAL. However, APLM-S and FISTA 
were considerably slower due to larger numbers of inner iterations. 


We plot the root-mean-squared-error (RMSE) over different values of À (which lead to different 
numbers of active genes) in the left half of Figure 4. The training set consists of 200 randomly 
selected samples, and the RMSE was computed on the remaining 95 samples. lı //2-regularization 
achieved lower RMSE in this case. However, lı //..-regularization yielded better group sparsity as 
shown in Figure 5. The sets of active genes selected by the two models were very similar as illus- 
trated in the right half of Figure 4. In general, the magnitudes of the coefficients returned by 1; /..- 
regularization tended to be similar within a group, whereas those returned by lı //2-regularization 
did not follow that pattern. This is because lı //..-regularization penalizes only the maximum el- 
ement, rather than all the coefficients in a group, resulting in many coefficients having the same 
magnitudes. 
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Figure 5: Pathway-level sparsity v.s. Gene-level sparsity. 
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Figure 6: Objective values v.s. Outer iters and Objective values v.s. CPU time plots for the Breast 
Cancer data. The results for ProxGrad are not plotted due to the different objective func- 
tion that it minimizes. The red (APLM-S) and blue (FISTA-p) lines overlap in the left 
column. 


5.4.2 VIDEO SEQUENCE BACKGROUND SUBTRACTION 


We next considered the video sequence background subtraction task from Mairal et al. (2010) and 
Huang et al. (2009). The main objective here is to segment out foreground objects in an image 
(frame), given a sequence of m frames from a fixed camera. The data used in this experiment 
is available online > (Toyama et al., 1999). The basic setup of the problem is as follows. We 
represent each frame of n pixels as a column vector A; € R” and form the matrix A € R"*” as 
A= ( A, Ao Ay ). The test frame is represented by b € R”. We model the relationship 
between b and A by b ~ Ax +e, where x is assumed to be sparse, and e is the ’noise’ term which is 
also assumed to be sparse. Ax is thus a sparse linear combination of the video frame sequence and 





5. Data can be found at 


testimages.htm. 


http://research.microsoft .com/en-us/um/people/jckrumm/wallflower/ 
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accounts for the background present in both A and b. e contains the sparse foreground objects in b. 
The basic model with /;-regularization (Lasso) is 





. 1 
min z ||Ax +e — b||? +4(llxll]ı + lella) (31) 
xe 2 
It has been shown in Mairal et al. (2010) that we can significantly improve the quality of seg- 
mentation by applying a group-structured regularization Q(-) on e, where the groups are all the 
overlapping k x k-square patches in the image. Here, we set k = 3. The model thus becomes 


3. Ul: 
min 5 llAx +e — b||? +4((lxlli + llelli +22). (32) 


Note that (32) still fits into the group-sparse framework if we treat the /)-regularization terms as the 
sum of the group norms, where the each groups consists of only one element. 

We also considered an alternative model, where a Ridge regularization is applied to x and an 
Elastic-Net penalty (Zou and Hastie, 2005) to e. This model 


il 
min 5 ||Ax+e = ||? +A llel +Aa(llxll? + llel?) (33) 


does not yield a sparse x, but sparsity in x is not a crucial factor here. It is, however, well suited for 
our partial linearization methods (APLM-S and FISTA-p), since there is no need for the augmented 
Lagrangian framework. Of course, we can also apply FISTA to solve (33). 

We recovered the foreground objects by solving the above optimization problems and applying 
the sparsity pattern of e as a mask for the original test frame. A hand-segmented evaluation image 
from Toyama et al. (1999) served as the ground truth. The regularization parameters À, 1, and Az 
were selected in such a way that the recovered foreground objects matched the ground truth to the 
maximum extent. 

FISTA-p was used to solve all three models. The /; model (31) was treated as a special case of 
the group regularization model (32), with each group containing only one component of the feature 
vector. For the Ridge/Elastic-Net penalty model, we applied FISTA-p directly without the outer 
augmented Lagrangian layer. 

The solutions for the lı /l2,1;/l.., and Lasso models were not strictly sparse in the sense that 
those supposedly zero feature coefficients had non-zero (albeit extremely small) magnitudes, since 
we enforced the linear constraints Cx = y through an augmented Lagrangian approach. To obtain 
sparse solutions, we truncated the non-sparse solutions using thresholds ranging from 10~° to 1073 
and selected the threshold that yielded the best accuracy. 

Note that because of the additional feature vector e, the data matrix is effectively A = ( A I, ) € 


R”x (m+n), For solving (32), FISTA-p has to solve the linear system 


ATA + [Dy AT Ce) 
A hn +iDe e re }? 


where D is a diagonal matrix, and Dy, De,Fx,Fe are the components of D and r corresponding to x 
and e respectively. In this example, n is much larger than m, for example, n = 57600,m = 200. To 





6. We did not use the original version of FISTA to solve the model as an /;-regularization problem because it took too 
long to converge in our experiments due to extremely small step sizes. 
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test image: b background: Ax 





50 100 
accuracy = 97.17% 


Ridge + ElasticNet 





150 


150 50 1 


100 50 100 00 
accuracy = 98.18% accuracy = 87.63% accuracy = 87.69% 


Figure 7: Separation results for the video sequence background substraction example. Each training 
image had 120 x 160 RGB pixels. The training set contained 200 images in sequence. 
The accuracy indicated for each of the different models is the percentage of pixels that 
matched the ground truth. 


avoid solving a system of size n x n, we took the Schur complement of J, + De and solved instead 
the positive definite m x m system 





1 1 1 
(ara f D, AT(I } Dy") s = ry—AT TAD te. 
H H H 
1 
e = diag(1+ De) '(re— Ax). 


The /, //.. model yielded the best background separation accuracy (marginally better than the 
1, /lz model), but it also was the most computationally expensive. (See Table 3 and Figure 7.) 
Although the Ridge/Elastic-Net model yielded as poor separation results as the Lasso (/;) model, it 
was orders of magnitude faster to solve using FISTA-p. We again observed that the dynamic scheme 
for u worked better for FISTA-p than for ADAL. For a constant u over the entire run, ADAL took at 
least twice as long as FISTA-p to produce a solution of the same quality. A typical run of FISTA-p 
on this problem with the best selected À took less than 10 outer iterations. On the other hand, ADAL 
took more than 500 iterations to meet the stopping criteria. 


5.5 Comments on Results 


The computational results exhibit two general patterns. First, the simpler algorithms (FISTA-p and 
ADAL) were significantly faster than the more general algorithms, such as APLM-S. Interestingly, 
the majority of the APLM-S inner iterations consisted of a skipping step for the tests on synthetic 
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Model Accuracy (percent) | Total CPU time (s) | No. parameter values on reg path 
h/h 97.17 2.48e+003 8 
Liles 98.18 4.07e+003 6 
hy 87.63 1.61e+003 11 
ridge + elastic net 87.89 1.82e+002 64 




















Table 3: Computational results for the video sequence background subtraction example. The algo- 
rithm used is FISTA-p. We used the Matlab version for the ease of generating the images. 
The C++ version runs at least four times faster from our experience in the previous exper- 
iments. We report the best accuracy found on the regularization path of each model. The 
total CPU time is recorded for computing the entire regularization path, with the specified 
number of different regularization parameter values. 


data and the breast cancer data, which means that APLM-S essentially behaved like ISTA-p in 
these cases. Indeed, FISTA-p generally required the same number of outer-iterations as APLM- 
S but much fewer inner-iterations, as predicted by theory. In addition, no computational steps 
were wasted and no function evaluations were required for FISTA-p and ADAL. Second, FISTA- 
p converged faster (required less iterations) than its full-linearization counterpart FISTA. We have 
suggested possible reasons for this in Section 3. On the other hand, FISTA was very effective 
for data both of whose dimensions were large because it required only gradient computations and 
soft-thresholding operations, and did not require linear systems to be solved. 

Our experiments showed that the performance of ADAL (as well as the quality of the solution 
that it returned) varied a lot as a function of the parameter settings, and it was tricky to tune them 
optimally. In contrast, FISTA-p exhibited fairly stable performance for a simple set of parameters 
that we rarely had to alter and in general performed better than ADAL. 

It may seem straight-forward to apply FISTA directly to the Lasso problem (31) without the 
augmented Lagrangian framework.’ However, as we have seen in our experiments, FISTA took 
much longer than AugLag-FISTA-p to solve this problem. We believe that this is further evidence 
of the ‘load-balancing’ property of the latter algorithm that we discussed in Section 3.2. It also 
demonstrates the versatility of our approach to regularized learning problems. 


6. Conclusion 


We have built a unified framework for solving sparse learning problems involving group-structured 
regularization, in particular, the J; /l2- or lı /l..-regularization of arbitrarily overlapping groups of 
variables. For the key building-block of this framework, we developed new efficient algorithms 
based on alternating partial-linearization/splitting, with proven convergence rates. In addition, we 
have also incorporated ADAL and FISTA into our framework. Computational tests on several sets of 
synthetic test data demonstrated the relative strength of the algorithms, and through two real-world 
applications we compared the relative merits of these structured sparsity-inducing norms. Among 
the algorithms studied, FISTA-p and ADAL performed the best on most of the data sets, and FISTA 





7. To avoid confusion with our algorithms that consist of inner-outer iterations, we prefix our algorithms with ‘AugLag’ 
here. 
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appeared to be a good alternative choice for large-scale data. From our experience, FISTA-p is 
easier to configure and is more robust to variations in the algorithm parameters. Together, they form 
a flexible and versatile suite of methods for group-sparse problems of different sizes. 
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Appendix A. Proof of Lemma 1 


F(x,y) —F(x,q) > F (x,9) — Ly (x,y, 4, Vyf (x,y) ) 


= FES- (Foy) YF) @-y) + sIM-sIP+8@)- 4 





From the optimality of g, we also have 
K+S) + 59-3) =0. (35) 
Since F (x,y) = f(x,y) +g(y), and f and g are convex functions, for any (x,y), 
F(E) 2 8@) + 9-9) ¥e(@) + f(y) + F-y) VS y) HE VS ay) BO) 
Therefore, from (34), (35), and (36), it follows that 


F(x,9)—F(x,g) > 9(@)+0-4'%(4) +f (%y¥) + G—-y) Vy f(a) 
+- x)" V, f(x,y) 


- (FEDVE -+ la-la) 








= (5-8) (3) + Vy) sella yl? + x)" Vif Ey) 





= 6-9" (-ha-») - Fla + ay VF oy) 





= gli- -I+ E-a VF.) 


The proof for the second part of the lemma is very similar, but we give it for completeness. 


Fæ) -F(a > Flsy)— (ADO a-la) 6D 

By the optimality of (p,q), we have 
Vif(p.q) = 9, (38) 
VAPNO) = 0 (39) 
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Since F (x,y) = f(x,y) + g(y), it follows from the convexity of both f and g and (38) that 


F (x,y) > 89) + 0-7) YO) +f) + (9-9) Vy f (p,a). (40) 


Now combining (37), (39), and (40), it follows that 
= 1 3 
F(x,y)—F(p.q) > 0-4)" (Ye(¥)+Vyf(P,9)) - ap ll4 —5\|7 


= 9-0" (16-0) - sols 








— 1 2 =1)2 
= 3p (lla yll — lly — Fl). 


Appendix B. Proof of Theorem 2 


Let I be the set of all regular iteration indices among the first k — 1 iterations, and let Z, be its 
complement. For all n € I, y"+! = J". 


For n € I, we can apply Lemma 1 since (18) automatically holds, and (16) holds when p < + 


L(f)* 
In (19), by letting (x,y) = (x*, y*), and ¥ = 7”, we get (p,q) = (x"*!y"*"), and 
2p(F(x*,y*) — F(T}, y+) > y —y* |? — ly" -—y" |. (41) 


In (17), by letting (*,¥) = (x*,y*), (x,y) = (ttl yrtly, we get g = rel ana 





o eA a lb yl? 
He SP Vey) 
= pea pro ks (42) 


since V,f(x"t!,y"*!) = 0, for n € I by (38) and for n € Ie by (15). Adding (42) to (41), we get 














20OF Ey SER EEO yy) ye ara Fle (43) 
For n € Ie, since V,f(x"*!, y"*!) = 0, we have that (42) holds. Since y+! = 9", it follows that 
2G VaR ay yey Sra aes (44) 


Summing (43) and (44) over n = 0,1,...,k — 1 and observing that 2|/| + |Ze| = k + kn, we obtain 


k-1 
2p (erre — F(x" y") — gromy) (45) 
n=1 nel 
= 1 2 2 
> Yer -y i -i-l 
n=0 


= |y*-y*?-|-y* I? 
> | -y| 


In Lemma 1, by letting (¥,9) = (x”*!,y"*!) in (17) instead of (x*,y*), we have from (42) that 


KG ay) — F(x”t! +1) > yt ay" |? > 0. (46) 
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Similarly, for n € Z, if we let (x,y) = (x”,y”) instead of (x*,y*) in (41), we have 


2p(F (x, 9") — F(t y"t!)) > y- FI? > o. (47) 


Forn € le, y"*! = y"; from (15), since x"*! = argmin, F (x,y) with y = y" =y""!, 


2p(F (x",9") —F(x"t!y"t")) > 0. (48) 


Hence, from (46) and (47) to (48), F(x", y") > F(x",9") > F(x"! y""!) > F(x"! y""!). Then, we 
have 
k-1 
Vere ye) Sk Ot sand: VFO yy) > er oy): (49) 
n=0 nel 


Combining (45) and (49) yields 2p(k + kn) (F (x*,y*) — F (x*,5*)) > —||y° —y* ||. 


Appendix C. Derivation of the Stopping Criteria 


In this section, we show that the quantities that we use in our stopping criteria correspond to the 
primal and dual residuals (Boyd et al., 2010) for the outer iterations and the gradient residuals for 
the inner iterations. We first consider the inner iterations. 


FISTA-p The necessary and sufficient optimality conditions for problem (10) or (13) are primal 
feasibility 


and vanishing of the gradient of the objective function at (x*,¥*), that is, 


0 = Vif (x",9°), SD 
0 € Vyf, y) +980). (52) 


Since y*+! = z*, the primal residual is thus y**+! — y+! = yt! —z*. It follows from the 


optimality of x*+! in Line 3 of Algorithm 6 that 
1 1 
Al (Ax*t! —b) — CT! + -CT (Cx! =.) As CT (yt! —z*) — 0 
u u 
1 
= Vaf (at! yt!) > =O) (x ay), 
u 


k+1 


Similarly, from the optimality of y*"" in Line 4, we have that 


p M hie 
0 € aes )+Vy Fe H2)t or —2) 
1 


u 








= ag!) +V Sat) 


= aga!) + Vy Say), 


1 
(yt! z“) “ib z (y! z£) 














where the last step follows from u = p. Hence, we see that ze (ck —y*+!) is the gradient 


residual corresponding to (51), while (52) is satisfied in every inner iteration. 
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APLMGS The primal residual is 7+! — y‘*+! from (50). Following the derivation for FISTA-p, it is 


not hard to verify that (52) is sivas satisfied, and the gradient residual corresponding to (51) 


is oe =y 


FISTA Similar to FISTA-p, the necessary and sufficient optimality conditions for problem (22) are 
primal feasibility 


y“) =O), 
and vanishing of the objective gradient at (x*,¥*), 
0 = Vif (9°), 
0 € Vyf) +80"). 


Clearly, the primal residual is (+! — zk, y! — z4) since (x*+1 y+!) = (zk, z%). From the 
optimality of (+1, 5**), it follows that 


1 
0 = Vif(& z) + 5h es 
1 
0 € BODES H 


Here, we simply use nG etl _ 2k) and 1 ay yet! — z) to approximate the gradient residuals. 


Next, we consider the outer iterations. The necessary and sufficient optimality conditions for 
problem (4) are primal feasibility 
Cx* — y“ =0, 


and dual feasibility 
0 = VLG@"*)-—C'r", 
0 € AQ(y*)+v 


Clearly, the primal residual is r’ = Cx! — y'. The dual residual is 


VL l+1 CT l 1 Cx!t! =l+1 
( œT) u ( cs , recalling that v/+! = v! — 1(Cy'+! — 541), The above 
) m 





dQ(y el) 4 a Ole ~F 


is simply the ae of the augmented Lagrangian (5) evaluated at (x',y',v’). Now, since the 
objective function of an inner iteration is the augmented Lagrangian with v = v’, the dual residual 
for an outer iteration is readily available from the gradient residual computed for the last inner 
iteration of the outer iteration. 


Appendix D. Numerical Results 
See Tables 4 to 10. 
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Data Sets Algs CPU (s) | Iters | Avg Sub-iters F(x) 
ADAL | 1.70e+000 | 61 1.00e+000 | 1.9482e+005 
APLM-S | 1.71e+000 8 4.88e+000 | 1.9482e+005 
FISTA-p | 9.08e-001 8 4.38e+000 | 1.9482e+005 
FISTA | 2.74e+000 | 10 7.30e+000 | 1.9482e+005 
ProxGrad | 7.92e+001 | 3858 - - 
ADAL | 6.75e+001 | 105 1.00e+000 | 1.4603e+006 
APLM-S | 1.79e+002 9 1.74e+001 1.4603e+006 
FISTA-p | 4.77e+001 9 8.56e+000 | 1.4603e+006 
FISTA | 3.28e+001 | 12 1.36e+001 1.4603e+006 
ProxGrad | 7.96e+002 | 5608 - - 
ADAL | 2.83e+002 | 151 1.00e+000 | 2.6746e+006 
APLM-S | 8.06e+002 | 10 2.76e+001 | 2.6746e+006 
FISTA-p | 2.49e+002 | 10 1.28e+001 | 2.6746e+006 
FISTA | 5.21le+001 | 13 1.55e+001 | 2.6746e+006 
ProxGrad | 1.64e+003 | 6471 - - 





ogl-5000-100-10-3 





ogl-5000-600-10-3 





ogl-5000-1000-10-3 





























Table 4: Numerical results for ogl set 1. For ProxGrad, Avg Sub-Iters and F(x) fields are not 
applicable since the algorithm is not based on an outer-inner iteration scheme, and the 
objective function that it minimizes is different from ours. We tested ten problems with 
J = 100,--- , 1000, but only show the results for three of them to save space. 








Data Sets Algs CPU (s) | Iters | Avg Sub-iters F(x) 
ADAL | 4.18e+000 | 77 1.00e+000 | 9.6155e+004 
APLM-S | 1.64e+001 9 2.32e+001 9.6156e+004 
ogl-1000-200-10-3 | FISTA-p | 3.85e+000 9 1.02e+001 9.6156e+004 
FISTA | 2.92e+000 | 11 1.44e+001 9.6158e+004 
ProxGrad | 1.16e+002 | 4137 - - 
ADAL | 5.04e+000 | 63 1.00e+000 | 4.1573e+005 
APLM-S | 8.42e+000 8 8.38e+000 | 4.1576e+005 
ogl-5000-200-10-3 | FISTA-p | 3.96e+000 9 6.56e+000 | 4.1572e+005 
FISTA | 6.54e+000 | 10 9.70e+000 | 4.1573e+005 
ProxGrad | 1.68e+002 | 4345 - - 
ADAL | 6.41e+000 | 44 1.00e+000 1.0026e+006 
APLM-S | 1.46e+001 | 10 7.60e+000 1.0026e+006 
ogl-10000-200-10-3 | FISTA-p | 5.60e+000 | 10 5.50e+000 1.0026e+006 
FISTA 1.09e+001 | 10 8.50e+000 1.0027e+006 
ProxGrad | 3.31e+002 | 6186 - - 






































Table 5: Numerical results for ogl set 2. We ran the test for ten problems with n = 1000, - -- , 10000, 
but only show the results for three of them to save space. 
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Data Sets Algs CPU (s) | Iters | Avg Sub-iters F(x) 

ADAL | 1.14e+001 | 194 1.00e+000 8.4892e+002 
ogl-dct-1000-5000-1 | FISTA-p | 1.21e+001 | 20 1.11e+001 8.4892e+002 
FISTA | 2.49e+001 | 24 2.51e+001 8.4893e+002 
ADAL | 3.31e+001 | 398 1.00e+000 1.4887e+003 
ogl-dcet-1000-10000-1 | FISTA-p | 2.54e+001 | 41 5.61e+000 1.4887e+003 
FISTA | 6.33e+001 | 44 1.74e+001 1.4887e+003 
ADAL | 6.09e+001 | 515 1.00e+000 | 2.7506e+003 
ogl-dct-1000-15000-1 | FISTA-p | 3.95e+001 | 52 4.44e+000 | 2.7506e+003 
FISTA | 9.73e+001 | 54 1.32e+001 2.7506e+003 
ADAL | 9.52e+001 | 626 | 1.00e+000 | 3.3415e+003 
ogl-det-1000-20000-1 | FISTA-p | 6.66e+001 | 63 6.10e+000 | 3.3415e+003 
FISTA | 1.8le+002 | 64 1.61e+001 3.3415e+003 
ADAL | 1.54e+002 | 882 1.00e+000 | 4.1987e+003 
ogl-dcet-1000-25000-1 | FISTA-p | 7.50e+001 | 88 3.20e+000 | 4.1987e+003 
FISTA | 1.76e+002 | 89 8.64e+000 | 4.1987e+003 
ADAL | 1.87e+002 | 957 1.00e+000 | 4.6111e+003 
ogl-dct-1000-30000-1 | FISTA-p | 8.79e+001 | 96 2.86e+000 | 4.6111e+003 
FISTA | 2.24e+002 | 94 8.54e+000 | 4.6111e+003 















































Table 6: Numerical results for dct set 2 (scalability test) with lı //,-regularization. All three algo- 
rithms were ran in factorization mode with a fixed u = uo. 
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Data Sets Algs CPU (s) | Iters | Avg Sub-iters F(x) 
ADAL 1.53e+001 | 266 1.00e+000 | 7.3218e+002 
FISTA-p | 1.6le+001 | 10 3.05e+001 7.3219e+002 
FISTA | 3.02e+001 | 16 4.09e+001 7.3233e+002 
ProxFlow | 1.97e+001 - - 7.3236e+002 
ADAL | 3.30e+001 | 330 1.00e+000 1.2707e+003 
FISTA-p | 3.16e+001 | 10 3.10e+001 1.2708e+003 
FISTA | 7.27e+001 | 24 3.25e+001 1.2708e+003 
ProxFlow | 3.67e+001 - - 1.2709e+003 
ADAL | 4.83e+001 | 328 1.00e+000 | 2.2444e+003 
FISTA-p | 5.40e+001 | 15 2.52e+001 2.2444e+003 
FISTA 8.64e+001 | 23 2.66e+001 2.2449e+003 
ProxFlow | 9.91e+001 - - 2.2467e+003 
ADAL | 8.09e+001 | 463 1.00e+000 | 2.6340e+003 
FISTA-p | 8.09e+001 | 16 2.88e+001 2.6340e+003 
FISTA 1.48e+002 | 26 2.93e+001 2.6342e+003 
ProxFlow | 2.55e+002 - - 2.6357e+003 
ADAL | 7.48e+001 | 309 1.00e+000 | 3.5566e+003 
FISTA-p | 1.15e+002 | 30 1.83e+001 3.5566e+003 
FISTA | 2.09e+002 | 38 2.30e+001 3.5568e+003 
ProxFlow | 1.38e+002 - - 3.5571e+003 
ADAL | 9.99e+001 | 359 1.00e+000 | 3.7057e+003 
FISTA-p | 1.55e+002 | 29 2.17e+001 3.7057e+003 
FISTA | 2.60e+002 | 39 2.25e+001 3.7060e+003 
ProxFlow | 1.07e+002 - - 3.7063e+003 





ogl-dct-1000-5000-1 





ogl-dct-1000-10000-1 





ogl-dct-1000-15000-1 





ogl-dct-1000-20000-1 





ogl-dct-1000-25000-1 





ogl-dct-1000-30000-1 





























Table 7: Numerical results for dct set 2 (scalability test) with lı //..-regularization. The algorithm 
configurations are exactly the same as in Table 6. 
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gradient method for single and multi-task regression with structured sparsity. Arxiv Preprint 
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Data Sets Algs CPU (s) | Iters | Avg Sub-iters F(x) 
FISTA-p | 1.83e+001 | 12 2.34e+001 8.4892e+002 
ogl-dct-1000-5000-1 FISTA | 2.49e+001 | 24 2.51e+001 8.4893e+002 
ADAL | 1.35e+001 | 181 1.00e+000 8.4892e+002 
FISTA-p | 3.16e+001 | 14 1.73e+001 1.4887e+003 
ogl-dct-1000-10000-1 | FISTA | 6.33e+001 | 44 1.74e+001 1.4887e+003 
ADAL | 4.43e+001 | 270 1.00e+000 1.4887e+003 
FISTA-p | 4.29e+001 | 14 1.51e+001 2.7506e+003 
ogl-dct-1000-15000-1 | FISTA | 9.73e+001 | 54 1.32e+001 2.7506e+003 
ADAL | 5.37e+001 | 216 | 1.00e+000 | 2.7506e+003 
FISTA-p | 7.53e+001 | 13 2.06e+001 3.3416e+003 
ogl-dct-1000-20000-1 | FISTA | 1.8le+002 | 64 1.61e+001 3.3415e+003 
ADAL | 1.57e+002 | 390 | 1.00e+000 | 3.3415e+003 
FISTA-p | 7.41e+001 | 15 1.47e+001 4.1987e+003 
ogl-dct-1000-25000-1 | FISTA | 1.76e+002 | 89 8.64e+000 | 4.1987e+003 
ADAL | 8.79e+001 | 231 1.00e+000 | 4.1987e+003 
FISTA-p | 8.95e+001 | 14 1.58e+001 4.6111e+003 
ogl-dct-1000-30000-1 | FISTA | 2.24e+002 | 94 8.54e+000 | 4.6111e+003 
ADAL | 1.12e+002 | 249 1.00e+000 | 4.6111e+003 








Table 8: Numerical results for the DCT set with /; //,-regularization. FISTA-p and ADAL were ran 


in PCG mode with the dynamic scheme for updating u. u was fixed at uo for FISTA. 


J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the | 1-ball for 
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Data Sets Algs CPU (s) | Iters | Avg Sub-iters F(x) 
FISTA-p | 2.30e+001 | 11 2.93e+001 7.3219e+002 
ADAL 1.89e+001 | 265 1.00e+000 | 7.3218e+002 
FISTA | 3.02e+001 | 16 4.09e+001 7.3233e+002 
ProxFlow | 1.97e+001 - - 7.3236e+002 
FISTA-p | 5.09e+001 | 11 3.16e+001 1.2708e+003 
ADAL | 4.77e+001 | 323 1.00e+000 1.2708e+003 
FISTA | 7.27e+001 | 24 3.25e+001 1.2708e+003 
ProxFlow | 3.67e+001 - - 1.2709e+003 
FISTA-p | 6.33e+001 | 12 2.48e+001 2.2445e+003 
ADAL | 9.41e+001 | 333 1.00e+000 | 2.2444e+003 
FISTA 8.64e+001 | 23 2.66e+001 2.2449e+003 
ProxFlow | 9.91e+001 - - 2.2467e+003 
FISTA-p | 8.21e+001 | 12 2.42e+001 2.6341e+003 
ADAL 1.59e+002 | 415 1.00e+000 | 2.6340e+003 
FISTA 1.48e+002 | 26 2.93e+001 2.6342e+003 
ProxFlow | 2.55e+002 - - 2.6357e+003 
FISTA-p | 1.43e+002 | 13 2.98e+001 3.5567e+003 
ADAL 1.20e+002 | 310 1.00e+000 | 3.5566e+003 
FISTA | 2.09e+002 | 38 2.30e+001 3.5568e+003 
ProxFlow | 1.38e+002 - - 3.5571e+003 
FISTA-p | 1.75e+002 | 13 3.18e+001 3.7057e+003 
ADAL | 2.01e+002 | 361 1.00e+000 | 3.7057e+003 
FISTA | 2.60e+002 | 39 2.25e+001 3.7060e+003 
ProxFlow | 1.07e+002 - - 3.7063e+003 
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Table 9: Numerical results for the DCT set with l; //..-regularization. FISTA-p and ADAL were ran 
in PCG mode. The dynamic updating scheme for u was applied to FISTA-p, while u was 
fixed at ug for ADAL and FISTA. 








Data Sets Algs CPU (s) | Iters | Avg Sub-iters F(x) 
ADAL | 6.24e+000 | 136 1.00e+000 | 2.9331e+003 
APLM-S | 4.02e+001 12 4.55e+001 2.933 1e+003 
FISTA-p | 6.86e+000 | 12 1.48e+001 2.933 1e+003 
FISTA | 5.11e+001 | 75 1.29e+001 2.9340e+003 
ProxGrad | 7.76e+002 | 6605 1.00e+000 - 





BreastCancerData 





























Table 10: Numerical results for Breast Cancer Data using lı //)-regularization. In this experiment, 
we kept u constant at 0.01 for ADAL. The CPU time is for a single run on the entire data 
set with the value of A selected to minimize the RMSE in Figure 4. 
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